Skip to content

Make text-to-fact criteria directive: exhaustive extraction + typed-literal compound terms#150

Merged
justinjoy merged 1 commit into
mainfrom
extraction-exhaustiveness
Jun 26, 2026
Merged

Make text-to-fact criteria directive: exhaustive extraction + typed-literal compound terms#150
justinjoy merged 1 commit into
mainfrom
extraction-exhaustiveness

Conversation

@justinjoy

@justinjoy justinjoy commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Two prompt-hardening changes to skills/factlog/references/text-to-fact.md,
the authoritative extraction criteria. Both convert a soft "may" into a "must,
when X" — a discretionary instruction the extractor reliably skips.

1. Exhaustive extraction (완전성 원칙)

Dense tables — participant rosters, financial/registry status, budget line
items, schedules, career/patent records — are the highest-density fact source,
yet the prior criteria only said "record relation candidates." In practice the
extractor skimmed prose and dropped repeated table rows: a real proposal with
~400 extractable facts yielded ~90 (≈20–25% coverage).

  • forbid sampling of repeated items ("대표 몇 개만" → extract all N)
  • table → triple mapping rule (row key→subject, header→relation, cell→object)
  • judge coverage by section/table sweep, not converted-file byte size
  • pre-finish self-check, PII exclusions preserved

2. Typed-literal compound terms (재량 아님)

Date/amount/ordinal/number objects left as prose strings ("2017.03.08",
"126백만원") can't be sorted/thresholded by the engine. Left to discretion the
extractor never emits compound terms (observed: 0 across a full sync).

  • require date()/ordinal()/amount()/number() for typed literals, with a
    prose→term mapping table
  • honest engine-support note: date/ordinal fully project; amount is
    positive-int + needs a unit table (use number() for negatives like a
    loss); number() projection still pending (feat(typed): number-type comparison (engine has no float text column) #125) but emit for structure
  • cross-reference attribute-relations.md / typed-relations.md so declared
    relations actually project and compare

Docs/criteria only — no code paths touched; the file is read at extraction
time so changes are live without reinstall.

Add a "완전성 원칙" section so extraction sweeps every section and table
row-by-row instead of skimming prose. Dense tables (참여인력 명부, 재무·등기
현황, 예산 비목, 추진 일정, 경력·특허 실적) were the main silent-omission
source — narrative got captured while repeated table rows were dropped.

- forbid sampling ("대표 몇 개만") of repeated items
- table → triple mapping rule (row key→subject, header→relation, cell→object)
- judge coverage by section/table sweep, not converted-file byte size
- pre-finish self-check, with existing PII exclusions preserved
@justinjoy justinjoy merged commit fe814a8 into main Jun 26, 2026
3 checks passed
@justinjoy justinjoy deleted the extraction-exhaustiveness branch June 26, 2026 11:30
@justinjoy justinjoy changed the title Mandate exhaustive fact extraction in text-to-fact criteria Make text-to-fact criteria directive: exhaustive extraction + typed-literal compound terms Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant